Abstract

Survival analysis is often used in cancer studies. It has been shown that combination of clinical data with genomics increases predictive performance of survival analysis methods. This tool provides a wide range of survival analysis methods for genomics research, especially in cancer studies. The tool includes analysis methods including Kaplan-Meier, Cox regression, Penalized Cox regression and Random Survival Forests. It also offers methods for optimal cutoff point determination for continuous markers.

Each procedure includes following features:

Kaplan-Meier: descriptive statistics, survival table, mean and median life time, hazard ratios, comparison tests including Log-rank, Gehan-Breslow, Tarone-Ware, Peto-Peto, Modified Peto-Peto, Flemington-Harrington, and interactive plots such as Kaplan-Meier curves and hazard plots.

Cox regression: coefficient estimates, hazard ratios, goodness of fit test, analysis of deviance, save predictions, save residuals, save Martingale residuals, save Schoenfeld residuals, save dfBetas, proportional hazard assumption test, and interactive plots including Schoenfeld residual plot and Log-Minus-Log plot.

Penalized Cox regression: feature selection using ridge, elastic net and lasso penalization. A cross-validation to investigate the relationship between partial likelihood devaince and lambda values.

Random survival forests: overall survival predictions (Nelson-Aalen estimator, overall ensemble), individual survival predictions (with OOB), individual cumulative hazard predictions (with OOB), error rate, variable importance, and interactive plots including random survival plot, cumulative hazard plot, error rate plot, Cox vs RSF plot

Optimal cutoff: determination of optimal cutoff value by maxmizing test statistics, including log-rank, Gehan-Breslow, Tarone-Ware, Peto-Peto, modified Peto-Peto, Flemington-Harrington.

1.Data upload

This tool requires a dataset in *.txt format, which is seperated by comma, semicolon, space or tab delimiter. First row of dataset must include header. When the appropriate file is uploaded, the dataset will be appear immediately on the main page of the tool. Alternatively users can upload one of the example datasets provided within the tool for testing and understanding the operating logic of the tool.

Data upload

Data upload help

2. Analysis Methods

2.1. Kaplan-Meier

Concept

Kaplan-Meier is a non-paranetric statistical method that is used to estimate survival probabilities and hazard ratios for a cohort study group. In clinical trials, it is often used to measure the part of patients living for a certain period of time after a treatment.

Variables

  • Survival time: Time until an event occurs (i.e. days, weeks, months, years)
  • Status variable: The event (i.e. death, disease, remission, recovery)
  • Category value for status variable: Category value of the event of interest (i.e. 1, yes)
  • Factor variable: A categorical variable which indicates different study groups (i.e. treatment, gender)

Usage

A Kaplan-Meier analysis can be conducted by applying the following steps:

  1. Select the analysis method as Kaplan Meier from Analysis tab.
  2. Select suitable variables for the analysis, such as survival time, status variable, category value for status variable and factor variable, if exists.
  3. In advanced options, one can change confidence interval type, as log, log-log or plain, variance estimation method, as Greenwood or Tsiatis, Flemington-Harrington weights, confidence level and reference category, as first or last.
  4. Click Run button to run the analysis.

Survival help

Outputs

Desired outputs can be selected by clicking Outputs checkbox. Available outputs are;

a. Case summary

Summary statistics, such as number and percent of observations, events and censored cases can be obtained.

b. Survival table

A survival table can be created. First column in the table represents factor group and number of time points (i.e. 1.2 means second time point in the first factor group, likewise 2.1 means first time point in the second group). Second column is survival time, third column gives number of subjects at risk, fourth column is the number of events, fifth column represents the cumulative probability of surviving, sixth, seventh and eight columns are associated standard error, lower and upper limits, respectively.

c. Survival plot

A forest plot can be created for each level of factor group using survival probabilites at each end point.

d. Mean and Median life time

Mean and median life time and their associated confidence levels can be calculated for each level of factor group.

e. Hazard ratio

Hazard ratios and their respective lower and upper limits can be calculated for each factor group at each end point.

f. Hazard plot

A forest plot can be created for each level of factor group using hazard ratios at each end point.

g. Comparison tests

Six different comparison tests can be calculated for testing the differences in survival probability estimations between factor groups.

h. Plots

Survival plots help

i. Kaplan-Meier curve

Kaplan-Meier curves can be created. A number of edit options is also available for plots.

j. Hazard plot

Hazard plot can be created. A number of edit options is also available for plots.

k. Log-Minus-Log plot

Log-Minus-Log plot can be created. A number of edit options is also available for plots.

2.2. Cox Regression

Concept

Cox regression, also known as proportional hazard regression, is a method to investigate the effect of one or multiple factors (i.e. gene expressions) upon the time an event of interest occurs. In this model, the effect of a unit increase in a factor is multiplicative with respect to the hazard rate.

Usage

A Cox regression analysis can be conducted by applying the following steps:

  1. Select the analysis method as Cox Regression from Analysis tab.
  2. Select suitable variables for the analysis, such as survival time, status variable, category value for status variable, and categorical and continuous predictors for the model.
  3. In advanced options, interaction terms, strata terms and time dependent covariates can be added to the model. Moreover, if there are multiple records for observations, users can specify it by clicking Multiple ID checkbox. Furthermore, once can choose model selection criteria, as AIC or p-value, model selection method, as backward, forward or stepwise, reference category, as first or last, and ties method, as Efron, Breslow or exact and change the confidence level.
  4. Click Run button to run the analysis.

Cox Regression help

Outputs

Desired outputs can be selected by clicking Outputs checkbox. Available outputs are coefficient estimates, hazard ratio, goodness of fit tests, analysis of deviance, predictions, residuals, Martingale residuals, Schoenfeld residuals and DfBetas.

a. Coefficient Estimates

A coefficient estimation table, which includes variable names, coefficient estimates and their associated standard errors, z statistics and p values, can be created.

b. Hazard ratio

A hazard ratio table, which includes variable names, hazard ratios and their associated lower and upper limits, can be created.

c. Hazard plot

A forest plot can be created for hazard ratios to give them a visual inpection.

d. Goodness of Fit Tests

Fitted Cox regression model can be tested with three tests: Likelihood ratio, Wald, Score.

e. Analysis of Deviance

A deviance analysis can be conducted for each variable in the fitted model.

f. Predictions

Predictions from the fitted model can be obtained.

g. Residuals

Residuals from the fitted model can be obtained.

h. Martingale Residuals

Martingale residuals from the fitted model can be obtained.

i. Schoenfeld Residuals

Schoenfeld residuals from the fitted model can be obtained.

j. DfBetas

DfBetas residuals from the fitted model can be obtained.

k. Proportional Hazard Assumption

Cox Regression help

l. Proportional Hazard Test

To check the proportionality assumption of Cox regression model, a proportional hazard test can be conducted both globally and for each variable in the fitted model.

m. Schoenfeld Plot

Beside a formal test for proportionality assumption, a Schoenfeld plot can be created to check the assumption visually.

n. Log-Minus-Log Plot

Another useful plot for checking proportionality assumption is log-minus-log plot. Lines should be parallel to each other to satisfy proportionality.

2.3. Penalized Cox Regression

Concept

Feature selection is an useful strategy to avoid over-fitting, to obtain more reliable predictive results, and to provide more insights into the underlying casual relationships (Ma and Huang, 2008). In this section, a feature selection can be performed using ridge, elastic net or lasso penalty, especially when there are too many predictors (e.g. n<<p). More information can be found in Zou and Hastie, 2005, Freidman et al, 2008 and Simon et al, 2011.

Usage

A Penalized Cox regression analysis can be conducted by applying the following steps:

  1. Select the analysis method as Penalized Cox Regression from Analysis tab.
  2. Select suitable variables for the analysis, such as survival time, status variable
  3. If all predictors are continious then one can check the Select All Variables option to include all variables in dataset to the feature selection process. If some predictors categorical and others are continious, then uncheck the Select All Variables option and select categorical and continuous variables seperately.
  4. Define the penalty term using the Penalty term slider as follow:

Penalty term = 0: ridge penalty 0 < Penalty term < 1: elastic net penalty Penalty term = 1: lasso penalty

  1. Select the number of folds for cross-validation. Note that number of folds must be greater than 3.
  2. Click Run button to run the analysis.

Cox Regression help

Outputs

a) Variables in the model

Variable selection is conducted with the selected penalized method (i.e. ridge, elasticnet, lasso) and results will be displayed as a table, which includes selected variables and their associated coefficient estimates.

b) Cross-validation curve

A cross-validation curve can be created to investigate the relationship between partial likelihood devaince and lambda values.

2.4. Random Survival Forests

Concept

Random survival forests, an ensemble method for analysing right censored data, first introduced by Ishwaran et al, 2008. RSF has several advantages over Cox regression: (i) Unlike Cox regression, RSF does not rely on proportional hazard assumption. (ii) RSF accounts for nonlinear effects and interactions for factor variables.

Usage

A random survival forests analysis can be conducted by applying the following steps:

  1. Select the analysis method as Random Survival Forests from Analysis tab.
  2. Select suitable variables for the analysis, such as survival time, status variable, category value for status variable, and categorical and continuous predictors for the model.
  3. In advanced options, interaction terms, strata terms and time dependent covariates can be added to the model. Moreover, if there are multiple records for observations, users can specify it by clicking Multiple ID checkbox. From RSF options, number of tree, bootstrap method, randomly selected number of variable, minimum number of cases in terminal node, maximum depth for a tree, splitting rule, number of split, missing values, number of iterations of the missing data algorithm, proximity of cases, size of bootstrap and type of bootstrap can be adjusted.
  4. Click Run button to run the analysis.

Cox Regression help

Outputs

## 
## Trees Grown:     657,    Time Remaining (sec):       2
a. Individual Survival Predictions

Survival predictions for each observation can be obtained. In this table, rows represent observations whereas columns represent time endpoints.

b. Individual Survival Predictions OOB

Out of bag (OOB) survival predictions for each observation can be obtained. In this table, rows represent observations whereas columns represent time endpoints.


c. Individual Cumulative Hazard Predictions

Cumulative hazard predictions for each observation can be obtained. In this table, rows represent observations whereas columns represent time endpoints.


d. Individual Cumulative Hazard Predictions OOB

Out of bag (OOB) cumulative hazard predictions for each observation can be obtained. In this table, rows represent observations whereas columns represent time endpoints.


e. Error Rate

An error rate table, which shows error rate estimations for each tree, can be obtained.


f. Variable Importance

A variable importance table as well as an interactive plot, which shows relative importance of variables in fitted model, can be obtained.


g. Overall Survival Plot

A survival plot can be created based on Nelson-Aalen estimator and overall ensemble predictions.

h. Individual Random Survival Plot

A survival plot can be drawn for survival predictions from random survival forests model. Each line represents a survival curve for each observation.


i. Individual Survival OOB Plot

A survival plot can be drawn for OOB survival predictions from random survival forests model. Each line represents a survival curve for each observation.


j. Individual Cumulative Hazard Plot

A cumulative hazard plot can be drawn for hazard predictions from random survival forests model. Each line represents a survival curve for each observation.


k. Individual Cumulative Hazard OOB Plot

A cumulative hazard plot can be drawn for OOB cumulative hazard predictions from random survival forests model. Each line represents a survival curve for each observation.


l. Error Rate Plot

An interactive error rate plot, which shows error rate alterations when number of trees increased, can be drawn.


m. Cox vs RSF

A Cox model can be compared to random survival forests model through an interactive plot for visual inspection of both models.

## 
## Trees Grown:     508,    Time Remaining (sec):       2

2.5. Optimal Cutoff

Concept

To investigate whether the higher or lower expressions of differentially expressed genes lead to more survival risks for patients, expression levels of genes can be dichotomized based on certain cutoff values by maximizing certain test statistics.

Usage

An optimal cutoff value can be determined by applying the following steps:

  1. Select the analysis method as Optimal Cutoff from Analysis tab.
  2. Select one or more markers from Select marker(s) box
  3. Select appropriate Survival time and Status variable
  4. Select a category for status variable
  5. Select a test statistic (log-rank, Gehan-Breslow, Tarone-Ware, Petp-Peto, Modified Petp-Peto, Flemington-Harrington) for optimal cutoff determination.
  6. In advanced options, one can change confidence interval type, as log, log-log or plain, variance estimation method, as Greenwood or Tsiatis, Flemington-Harrington weights and confidence level
  7. Click Run button to run the analysis.

Cox Regression help

Outputs

a) Optimal cutoff value(s)

An optimal cutoff value can be obtained as well as hazard ratio (HR) with confidence interval, mean survival time for low and high gene expression levels, and p value for selected significance test.

#####b) Optimal cutoff value(s) A Kaplan-Meier plot can be created after dichotomize the gene expression level as high and low.